ROCm dan HIP: Tutorial Lengkap 10 Bab: Sifat Memori yang Menentukan Kinerja GPU

Dalam akselerasi GPU, kita harus meninggalkan pola pikir "komputasi dulu". Kinerja modern ditentukan oleh Manajemen Memori: pengaturan alokasi data, sinkronisasi, dan optimisasi antara host (CPU) dan perangkat (GPU).

1. Ketidakseimbangan Memori-Komputasi

Sementara throughput aritmetika GPU ($TFLOPS$) melonjak tajam, bandwidth memori ($GB/s$) tumbuh jauh lebih lambat. Hal ini menciptakan celah di mana unit eksekusi sering "kelaparan", menunggu data dari VRAM. Akibatnya, Pemrograman GPU sering kali merupakan pemrograman memori.

2. Model Roofline

Model ini menggambarkan hubungan antara Intensitas Aritmetika (FLOPs/Byte) dan kinerja. Aplikasi biasanya terbagi menjadi dua kategori:

Terbatas oleh Memori: Dibatasi oleh bandwidth (lereng curam).
Terbatas oleh Komputasi: Dibatasi oleh TFLOPS puncak (langit-langit horizontal).

3. Pajak Perpindahan Data

Hambatan kinerja utama jarang berasal dari matematika; melainkan latensi dan biaya energi dalam mentransfer satu byte melalui bus PCIe atau dari HBM. Kode performa tinggi memprioritaskan keberadaan data dan meminimalkan transfer antara host dan perangkat.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.